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ABSTRACT 

Increasing data volumes delivered by a new generation of radio interferometers require 
computationally efficient and robust calibration algorithms. In this paper, we propose 
distributed calibration as a way of improving both computational cost as well as 
robustness in calibration. We exploit the data parallelism across frequency that is 
inherent in radio astronomical observations that are recorded as multiple channels 
at different frequencies. Moreover, we also exploit the smoothness of the variation of 
calibration parameters across frequency. Data parallelism enables us to distribute the 
computing load across a network of compute agents. Smoothness in frequency enables 
us reformulate calibration as a consensus optimization problem. With this formulation, 
we enable flow of information between compute agents calibrating data at different 
frequencies, without actually passing the data, and thereby improving robustness. 
We present simulation results to show the feasibility as well as the advantages of 
distributed calibration as opposed to conventional calibration. 

Key words: Instrumentation: interferometers; Methods: numerical; Techniques: in¬ 
terferometric 


1 INTRODUCTION 


Many of the science drivers in modern radio astronomy seek 
weak signals buried in noise and bright foregrounds. Exist¬ 
ing radio interferometers are upgraded and new ones are 
being built to deliver large volumes of data to achieve this 
goal. A major step of data processing in such telescopes is 
the correction of systematic errors and the removal of con¬ 
taminating foregrounds from the data, which is called cali¬ 
bration. With wide fields of view, calibration has to be done 
along hundreds of direction s in the sky, especially at low ra¬ 
dio frequencies dBregmanll2012l l. This entails solving for a 
large number of unknowns and a reliable solution can only 
be obtained if there are sufficient constraints. 


There are several novel calibration algorithms 
(|Kazemij3taJj|201l]; iKazemi fe Yatawattall2013l : lYatawattal 


120131: lTasseTl2014l) that are presently being used that 


improve sp eed and robustness i n calibration. Most of these 
algorithms iKa zemi et, al.ll201ll : IKazemi fc Yat,awattall2013l : 
lYatawattal 120131 ) use an algebraic data model that directly 
solve for Jones matrices representing the cumulative effect 
of the systemati c errors. On the other hand , solving for a 
physical model iBregmanl 120121 ; iTassc 2014) would reduce 
the number of unknowns, especially since most of the 
systematic errors are known to have a smooth variation 
across frequency. One drawback of the physical model 
based calibration is the need to access data across a wide 
frequency range, which is computationally not feasible at a 
central location given that there are thousands of frequency 


channels in the data. Moreover, a physical model requires 
an accurate description of the frequency dependence and 
this can only be done for specific and well studied errors. 
Therefore, in this paper we propose a distributed calibration 
scheme that preserve the simplicity and computational 
speed of algebraic model based calibration while enforcing 
the smoothness of the calibration parameters across fre¬ 
quency. This can be thought of as getting the best of both 
aforementioned calibration approaches. In order to do this, 
we reformulate calibration as a distribute d opt im izatio n 
problem and use consensus optimization (iBovd et al.ll20lil l. 


Distributed learning and distributed optimization 
(I T sitsiklid 1 1 9841 : iBertsekas fe Tsit siklid [l~9 9 7lJ is a widely re- 
search e d topic in v arious disciplines. Consensus optimization 
(IBovd et al.ll2011f ) is one algorithm for distributed learning. 
In the era of exascale computing and big data, the impor¬ 
tance of such algorith ms grow ever more ( f or some recent re¬ 
sults se e for instan ce Chang et alJ (12014 1: IWei fe Ozdaelarl 
<|2012lJ : iMota et al.l (120131 11. Instead of one compute agent 
accessing data across all frequencies (which is computation¬ 
ally unfeasible), we consider a situation where a group of 
compute agents accessing data across smaller frequency in¬ 
tervals. This matches ideally with how radio astronomical 
data is organized (data for the full observing bandwidth is 
divided into channels and channels are grouped into sub¬ 
bands), and stored. Therefore, we consider a situation where 
each compute agent having access to only a few subbands 
(while the full bandwidth consists of a few hundred sub¬ 
bands). Each compute agent will calibrate the data available 
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locally (using an algebraic model) and the calibration solu¬ 
tions are transferred to a centralized location (fusion center). 
At the fusion center, consensus on the smoothness of the 
parameters across frequency is enforced. Afterwards, this 
update is passed back to each compute agent. Therefore, 
indirectly, each compute agent receives information across 
the whole frequency range, thus improving the calibration. 
Moreover, since no attempt is made to directly model or 
estimate underlying physical parameters, the calibration al¬ 
gorithms are simpler and less susceptible to model errors. 
Furthermore, the amount of information that needs to be ex¬ 
changed between the fusion center and the compute agents 
is much less compared to the amount of data being cal¬ 
ibrated, making this scheme computationally feasible. We 
also note that similar approaches have been proposed a nd 
tested for radio astronom ical image synthesis dFerrari et al.l 
12014 ICarrillo et al .1 [20~i~il) to reduce the number of Fourier 
space samples used in imaging as well as to improve the 
quality of reconstruction. Such imaging approaches would 
certainly complement the calibration approach proposed in 
this paper. 

The rest of the paper is organized as follows: In sec¬ 
tion 0 we introduce radio interferometric calibration and 
in section [3] we reformulate it as a distributed consensus 
optimization problem. We give results based on simulations 
in section [I] to show the feasibility and superiority of the 
proposed scheme and draw our conclusions in section [5] 

Notation: Lower case bold letters refer to column vec¬ 
tors (e.g. y). Upper case bold letters refer to matrices (e.g. 
C). Unless otherwise stated, all parameters are complex 
numbers. The set of complex numbers is given as C while the 
set of real numbers as R. The matrix pseudo-inverse, trans¬ 
pose, and Hermitian transpose are referred to as (.)', (.) T , 
(,) H , respectively. The matrix Kronecker product is given by 
0. The identity matrix is given by I. Estimated parameters 
are denoted by a hat, (.). The Frobenius norm is given by 
||.||. A uniform distribution in [0,1] is given as U(0, 1). 


2 RADIO INTERFEROMETRIC 
CALIBRATION 

In this section we give a brief overview of the data model 
used in radio interferometri c calibration (Hamake r et al.l 
1 19961 : I Thompson et al.l 120011 1. We consider the radio fre¬ 
quency sky that is part of the sky model to be composed 
of discrete sources, far away from the earth such that the 
approaching radiation from each one of them appears to be 
plane waves. However, in reality there is large scale diffuse 
structure as well. There are N receiving elements with dual 
polarized feeds in the array and at the p-th station, this 
plane wave causes an induced voltage, which is dependent 
on the beam attenuation as well as the radio frequency re¬ 
ceiver chain attenuation. Consider the correlation of signals 
at the p-th receiver and the g-th receiver, with proper sig¬ 
nal delay at frequency / and time t (with finite bandwidth 
and integration time). After correlation, the correlated sig¬ 
nal of the p-th station and the g-th station (named as the 


visibilities), \/(p,q,t, /) (G C 2x2 ) is given by 


K 

V(p, q, t,f) = Yl J (P’ fe ’ /) C (f’ <?, k , /)•%, k , t, f) H + Npq. 

k = 1 

(1) 

In JT|, J(p, k, t , /) and J(g, k, t, f) are the Jones matrices de¬ 
scribing errors along the direction of source k, at stations p 
and g, at time t and frequency /, respectively. These ma¬ 
trices represent the effects of the propagation medium, the 
beam shape and the receiver. There are K sources in the sky 
model and the noise matrix is given as N P9 (G C 2x2 ). The 
contribution from the fc-th source on baseline pg is given by 
the coherency matrix C(p, g, k,t, f) (G C 2x2 ). We consider 
the sources in the sky model to be unpolarized and for the k- 
th direction, with intensity J(p, g, k, /) (invariant over time 
but dependent on p,g if the source is resolved) we have 


C(p, g, k,t, f) 




i(p,q,k,f) 0 

0 I(p, g, k, f) 


( 2 ) 

where <j)(p, g, k, t, /) is the Fourier phase component that 
depends on the direction in the sky as well as the separation 
of stations p and g and can be exactly calculated. Moreover, 
it is also possible to refine C(p ,q,k,t, f) to include finite 
integration time and bandwidth iThompson et af]|200ll 'l but 
in this paper we use the simpler form. The noise matrix 
N P9 is normally assumed to have elements with zero mean, 
complex Gaussian entries with equal variance in real and 
imaginary parts but the statistics w ill vary beca use of the 
unmodeled structure (IKazemi fe Yatawattall2013l h 

Calibration is the estimation of a set of parameters 6 
that describe the Jones matrices J (p,k,t,f) for p G [1,IV] 
and k G [1, K ] for given t and /. The solutions obtained are 
additionally used to correct the data and also to calculate 
the residual by subtracting the predicted model from the 
(corrected) data. The maximum likelihood (ML) estimate 
of 6 under zero mean, white Gaussian noise is obtained by 
minimizing the least squares cost function 


5 , ( 0 ) = 2I5Zll V (P> «>*>/) (3) 

t,f P,q 

K 

- Jfa, k, t, /)C(p, g, k, t, /)J(g, k, t, f) H || 2 

k=l 


and can be improved by using a weighted least squares 
estimator to accoun t for errors in the sky model 
dKazemi fe Yatawattall2013lh At this point, we make several 
points clear and make certain assumptions: 


• The solutions 0 are assumed to be invariant over time, 
within the time interval g(6) is minimized, therefore from 
now on, we drop the time dependence from J(p, k , t, /) and 
use J(p, k, /) instead. This can also be done for C(p, g, k, t, f) 
to have C(p, g, k, /) and V(p, g, t, f) to have V(p, g,/), be¬ 
cause the geometry of baseline pg is dependent on t (also 
the summation over t is implicitly assumed but not explic¬ 
itly stated). 

• The solutions for different directions k are assumed 
to have statistically independent noise, therefore, we use 
expectation maximization (EM) a nd space alternating 
expectation maxim ization (SAGE) (Fessl er fe Herd Il994l : 
IKazemi et al.ll201lh to simplify the cost function in © over 
summation in k. 
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• We do assume variability of the solutions over /, indeed, 
this is the novelty of this paper. For some directions, we 
can assume a smooth vari ation of the underlying parameters 
over / as done in dTassd 20L41. However, the drawback of 
the approach taken in ( Tassel 2014h is the amount of data 
needed to get a reliable estimate of this variation over /. 
Indeed for a telescope like LOFAR that observe over a wide 
bandwidth, producing hundreds of subbands and thousands 
of channels of data, even storing all subbands at one location 
is problematic, let alone reading that data into memory. 


Therefore, in this paper, we reformulate calibration as a 
distributed optimization problem. We assume smoo t h var i- 
ation of the parameters over /, but unlike in d Tassel[2014 ). 
we do not directly estimate those underlying parameters. In 
other words, the smooth variation is imposed as an addi¬ 
tional constraint, but the calibration problem is still kept 
unchanged by estimating J(p, fc, f ) for each p, k and /. Note 
that since the storage of data is by default distributed over 
/, i.e. different subbands (channels) are stored at different 
locations, the optimization can also be done in a distributed 
way. This distribution of computations does not necessarily 
reduce the total computational cost, but it can reduce the 
computational cost required at any one location where data 
is stored, provided that the computations only access the 
data that is locally available. In the following section, we 
describe how this can be done. 


3 DISTRIBUTED CALIBRATION 


Consider the Jones matrices along the fc-th direction, 
J(p, fc,/), for N stations, let 


‘ J(1 ,*,/) 1 

A -K2, fc,/) 

J k f = 

. J (N,k,f) _ 


(4) 


where J kf (E C 2 x2 ) is the augmented Jones matrix. Also 
define a canonical selection matrix A p (eC 2x2iv ) 

Ap = [0, 0,..., I,..., 0]. (5) 


where all elements of A p are zero except the p-th block which 
is an identity matrix. Using A p and J kf, we can recover 
J (P,k,f) = A p Jfc/. 

The ML estimate for 6 can ideally be obtained by min¬ 
imizing ©, but this needs access to all data. In normal cal¬ 
ibration, solutions are obtained separately for each /, using 
data at that frequency. For given /, consider partitioning the 
parameters as { G k f : k = 1... K}. We apply the EM/SAGE 
algorithm (Ka zemi et al.l [201 ill to estimate each 0 k f- The 
expectation step in SAGE finds the visibility contribution 
Vpqkf from V(p, q, f) (with C pqkf = C (p,q,k,f)) as 

V P qfc/ = V(p, q, f) — y A p JifCpqif (A 9 J if) (6) 

i,i^k 

and using this, in the maximization step, the current esti¬ 
mate for 9kf is obtained by minimizing 

9kf{0kf ) = y ] II Vpqkf — A p Jfc/C p qfc/ (A,J kf) || ■ (7) 

p,q 


Now, to simplify the description even further, we only 
consider calibration along the fc-th direction or minimizing 
9kf(9kf), so we drop the subscript fc. Let G k f = J kf = J / 
where J / is defined as in ©• Thereafter, we have the sim¬ 
plified form for © 

9fCf) = l|VpqJ ~ A p J/C pg / (AqJ/) || . (8) 

p,i 

So far, we have not imposed the smoothness over / to J/, 
in order to do that, we introduce hidden variables Z i (£ 
C 2iVx2 ), l £ [1,E], and we enforce the relationship 

J f = 'E b ‘(f)Z l (9) 

i 

onto J/. In (|TJ|) . the only frequency dependence on the right 
hand side is introduced by real scalar values bi(f) that can 
be thought of as polynomial terms (in /). The order of 
the polynomial is F — 1 (where F > 1) and this controls 
the smoothness. For instance, given reference frequency /o, 

we can select 6((/) = ( ~j ~ j , but this is one possible 
polynomial and we can use more sophisticated expressions 
if needed. 

If 6/ (e R Fx ) is a vector representing all polynomial 
terms 


b f = [bi(f) b 2 (f) ... M/)] t 

(10) 

we can rewrite © as 


N 

e *- 1 

CO 

II 

N 

CM 

(8) 

II 

(ii) 

where Ljv is the 2 N x 2 N identity matrix, B f = (bj ® Ljv) 
(G R 2JVx2FJV ) and Z (£ c 2FiVx2 ) is the augmented matrix 
of hidden variables 

z = [zT z F ... z t f ] t . 

(12) 


For each direction fc, by imposing smoothness, we can find 
a set of hidden variables Z for any given value for F. At this 
point, we distinguish between direct estimation of Z and the 
method proposed in this paper: 

• Centralized calibration is estimating Z directly from the 
data. However, this requires access to all frequencies (or at 
least a set of frequencies more than F) as shown in Fig. [T] 
(a). More rigorously, centralized calibration is estimating Z 
such that 3/(J/) is minimized. As we explained before, 
this is computationally not feasible because of the large data 
volumes needed. 

• Instead of centralized calibration, we formulate dis¬ 
tributed calibration as follows. Let there be P computational 
agents (or nodes) in a network. We assume the *-th agent 
will only have access to the data at frequency fi as in Fig. 
□ (b). However, we enforce consensus among all agents, in 
other words, we enforce an additional constraint J/ 4 = B^Z 
that all agents have to satisfy. Note that the total num¬ 
ber of frequencies that the data is taken will almost surely 
be higher than P. In that case, we consider calibration of 
a subset of P frequencies selected from the total available 
frequencies. This selection has to be repeated sequentially 
until all the frequencies are calibrated. 

With this network setup, we formulate distributed cal- 
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Figure 1. Centralized calibration compared with distributed cal¬ 
ibration. (a) Centralized calibration requires access to data ob¬ 
served at multiple frequencies, (b) Distributed calibration uses 
agents that operate on data taken at only a single frequency but 
via a fusion center, information is passed to other agents operating 
on data at different frequencies. The exact functions minimized in 
centralized calibration gi(S) and distributed calibration are 
slightly different. In normal calibration, each agent in (b) operate 
independently without communicating with the fusion center or 
any other agent. 


ibration as 

{J/i,J /2 ,...,Z} = arg min (13) 

J% ’ ’ z 

subject to J f i = B/,Z, i € [1,P] 

which is actual ly a consensus optimization problem 
rtBovd et alil20lJ) . To solve this, we use the augmented La- 
grangian method with the Lagrangian 

/•(J/,_ ■ J .Z. Yf,. Y, 2 ....) (14) 

= + II Y h( J h - B A Z )II + f ||J / 4 - B^ZII 2 

z 

= E Li (J/ 4 ,Z,Y /4 ) 

z 


where Y f t are the Lagrange multipliers and p is the regu¬ 
larization factor. In order to solve d, we use the consen- 
sus alternating di rection method of multipliers (C-ADMM) 
dBovd et al.il20lil 'l. If superscript n denote values at the n-th 
C-ADMM iteration, the values for the (n + l)-tli iteration 


are updated as 

(J h) n+1 = min U (J, (Z)", (Y/J”) 
(Z)" +1 = min £ U ((J /t r +1 , Z, (Y /4 d 

i 

(Y h ) n+1 = (Y/ i )" + p((J/J n+1 ^ B fi (Z) n+1 ). 


(15) 

(16) 

(17) 


The minimization CD3) has no closed form solution and 
needs to be done iteratively, for instance by using 
the Broyden-Fletcher-Go ldfarb-Shanno (BFGS) algorithm 
iNocedal fe Wright! 119991) or by using trust-region algo¬ 
rithms. In this paper, we use the Riemannian trust-region 
algorithm (RTR) described in I Absil et al.l d2007l ) for this 
minimization, and we need to calculate the gradient and 
the Hessian. The gradient and_the Hes sian w ith respect to 
J/ of (1151) are given as ( see lYatawattal d2013l ) and appendix 
0 for proof) 

grad^L,, J) = grad, (g fi (J), J) + 1y /( + f (J - B fi Z) (18) 
and 

Hessi(Li, J, rj) = Hessi (g fi ( J), J, rj) + 0 + ^r) (19) 

where we use © to get 

g rad i(s , / i (J),J) ( 20 ) 

= - E ~ A P JC P , /4 J H Aj')A,JCf, /4 

p,q 

+ Ag (Vpqfi — ApJCpg^, J Ag ) ApJQpqf^ 

and 


Hessi (fl/ 4 (J), J,r/) 

= E ( A p (( V P9/> - A p JCpg/ 4 J h A^ )A q rj 

p,q 


-A p (JC pqfi ri H + T7C P g/ i J if )A^Agjj C pqf . 

+Ag ((Vpg /4 - Ap]Cp qh ] H A T q ) H A pV 
-A,(JC p , / 4 »j h + r,C pqf J H ) H A F A p j) C pqfi ) . 


( 21 ) 


Minimization of HU can be done in closed form. We 
take the derivative to get 

grad(L, Z) = E B£ (-Y, 4 + p(-J/ 4 + B, 4 Z)) (22) 

z 

and equating this to zero gives us 

which can be further simplified by using B/, = (t> F ® Eat) 
to get 

E b * ® 

z 

(24) 

Each column of Z in (1241) can be written as2 = (P®l 2 iv)r’, 
where P = j (£ i b/ i b/J t (€ R FxF ) and r £ c 2FNxl is 
the corresponding column of the right hand sum of (EH). 
We can reshape r to get R £ C 2 xF . Then we can rewrite 
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z = vec (j2ivRP T j = vec ^RP T j which is far simpler to 
obtain than directly solving (I24h . Moreover, we see that from 
(1241) in order to have full rank, the summation ^ . bf . b'j. 
should at least have F terms (because the size of bf. bf. is 
F x F). In other words, we need to have data for at least F 
different frequencies, or P > F. 

To recapitulate, we consider P compute agents operat¬ 
ing simultaneously. Each agent i only has access to the data 
at frequency /,. There is also a data fusion center with which 
each agent does communication. Consider calibration along 
a single direction first. Each agent i needs to estimate J f i 
(G <C 2JVx2 ) and will keep auxiliary variable Y/ 4 (G C 2iVx2 ) 
locally. The fusion center will keep the global variable Z 
(G C 2FJVx2 ) and will pass the product B/,Z (G C 2JVx2 ) onto 
the i-th agent. With this additional variables, the C-ADMM 
algorithm for a single direction (K = 1) can be described 
as: 

(51) Each agent i finds estimate for J f i by solving (1151) . 
Thereafter, it sends back the result Y f t + pj/. to the fusion 
center. 

(52) At the fusion center, after collecting the values Y/ ; + 
pj f,- from all agents, (1161) is minimized by solving (12411 . Once 
the updated Z is obtained, B/Y is sent back to the i -th 
agent. 

(53) At agent i, Y f i is updated by using (1171) with the 
new value of B/ ; Z received from the fusion center. If stopping 
criteria (such as the maximum C-ADMM iterations) are not 
met, we go back to (SI) above. 

Note that steps (SI) and (S3) above are done simultane¬ 
ously at each agent. The centralized step (S2) is only an 
averaging step which is far less expensive compared with 
the minimization in (SI). 

The above description is only for calibration along a 
single direction. In order to apply the same method for cal¬ 
ibration along K directions, we only need slight modifica¬ 
tions to the steps described above. We use subscript k to 
indicate the fc-th direction. Each agent i needs to estimate 
K values J kf t - Moreover, each agent has K auxiliary vari¬ 
ables Y kf t - The fusion center keeps the global variables Z 
(G C 2KFNx2 ) which has I\ blocks (let us denote the fc-tli 
block of Z as (Z)*,), one for each direction. Therefore, we 
have the C-ADMM algorithm for K directions as: 

(Dl) Each agent i finds estimate for K values J kf t - This 
is done by decomposing the K direction problem onto K 
problems of th e type (1151) . In order to do this, we use SAGE 
algorithm dKazemi et al.l[201ll l. Note that in SAGE algo¬ 
rithm, we need to calculate the conditional mean of the data 
for each direction (expectation step) and we calculate this 
ignoring the auxiliary variables and the regularizing term. 
However, in the maximization step of the algorithm, we solve 
<na> with full regularization. Thereafter, it sends back the 
results Ykfa F pRf, to the fusion center (K values). 

(D2) At the fusion center, The block matrix Z is updated 
by solving 01 for I\ blocks separately. Thereafter, with the 
updated Z, B^ (Z)*, is sent back to the i-th agent (K values). 

(D3) At agent i, Y kf t for I\ values are updated using 
m and if stopping criteria are not met, we go back to (Dl) 
above. 

The initialization for the C-ADMM algorithm is done 


as follows. First, the initial values for J kf t can be taken as a 
block matrix of 2 x 2 identity matrices (or for a warm start, 
we can take the solutions from the previous time slot). The 
initial values for both Z and Y kf t are taken as 0. Because of 
this, the solutions obtained for J kf t at the first C-ADMM 
iteration for step s (SI) and (Dl) will have an unknown uni¬ 
tary ambiguity dYatawattal l2012~ah . Therefore, only at the 
first iteration, the averaging step in (S2) and (D2) should 
be done after projecting each J kf t to the mean value cal- 
culated using the quotient manifold structure described in 
dYatawatta] |2012al. For the remaining iterations, because Z 
and Ykfi are not 0, the unitary ambiguity will be common 
(for each direction) and normal Euclidean averaging can be 
done in steps (S2) and (D2). 

The selection of the regularization parameter p (> 0) 
is specific to each problem and more detail can be found 
in IBovd et al.l (|20 11). We note here that it is possible to 
select different values of p for different directions when I\ 
directions are calibrated. For instance, for source clusters 
that are far away from the phase center, it might be true 
that there will not by any smooth variation of the errors with 
frequency along that direction. Therefore, for that specific 
direction, we can make p very small (so no smoothness is 
enforced). On the other hand, for source clusters at the phase 
center (also at the center of the beam), we can safely assume 
that the errors vary smoothly with frequency and use a high 
value for p (typically G [1, 10]). In section 2J we provide 
simulations where we have varied the value of p and see how 
the performance change. 

The convergence of distributed calibration is discussed 
in appendix m in detail. This boils down to having a sky 
model with finite, non-zero flux and data with finite values, 
but the true sky can have zero flux, and then the solutions 
will be zero. Convexity of the c ost functions are also de¬ 
sired for C-ADMM to converge (Boyd et al.l l201lh and for 
an interferometric data model, this generally is assumed to 
hold. 

The amount of information that needs to be exchanged 
between the i-th agent and the fusion center is K x 2 N x 2 
(complex variables). In contrast, the amount of observed 
data used in calibration at the i-th agent is of the or¬ 
der N(N - l)/2 x 2 x 2 xT for T time samples with 
N(N — 1)/2 baselines. Therefore, when working with P fre¬ 
quencies, the total amount of data that needs to be accessed 
is T(N — l)/2 x 4 NP and for K directions, the total amount 
of information that needs to be exchanged in distributed cal¬ 
ibration is IT x ANP. Hence, when K <C T(N — l)/2, the 
amount of information passed is much less in distributed 
calibration, regardless of the value of P. 

The total number of computations in distributed cali¬ 
bration compared to normal calibration is not significantly 
different, and in fact it could even be higher. However, we 
gain a significant reduction in operational and energy cost 
by being able to distribute the total computations across a 
network of compute agents. In addition, there are several 
possibilities to reduce the computational cost even further 
and these will be explored in future work. First, it is possi¬ 
ble to eliminate the n eed for having a fusion center llErseghel 
l2012l: ISM et a l. inn and design an algorithm where agents 
only pass data between their neighbours. Secondly, when 
there are data with more frequencies than the number of 
compute agents, a multiplexing scheme where each agent 
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alternates the data used in calibration, and yet calibrates 
the full dataset can be investigated. 


4 SIMULATIONS 

In this section, we present several simulations to illustrate 
the performance of distributed calibration. We consider a ra¬ 
dio telescope similar to LOFAR, observing in the frequency 
range 115 MHz to 185 MHz with N = 47 stations pointed at 
the north celestial pole (|Yatawatta et al1l201.ll ). We consider 
data taken at P = 32 different channels (with bandwidth 0.2 
MHz each) uniformly spaced in frequency within the observ¬ 
ing frequency range. Therefore the frequency range covered 
by the data is 70 MHz wide but the actual bandwidth is 6.4 
MHz. In a typical situation, in order to increase the number 
of constraints, calibration is performed for about every few 
minutes of data, using more than 1 time sample (for instance 
with 10 s integration, we have 30 samples for 5 minutes of 
data). In our simulations we only use 20 time samples in 
all calibration tests, equivalent to a total integration time of 
200 s. Note that we call calibration of individual channels 
separately (without using the information across frequency) 
as normal calibration throughout this section. 

We simulate 0 and the Jones matrices J(p, k, t, f) are 
simulated as follows. The variation with t is simulated as 
sin(ait' + 27r/3i) + ysiii^t' + 2 ^ 2 ) where an, ai, /3i, P 2 are 
drawn from a uniform distribution U{ 0 , 1 ) for each station 
p and direction k, and t' is time sample number. The vari¬ 
ation across frequency is simulated by using a polynomial 

Xw=i( 7 z + jSi) ( J fg° J where 7 ;, Si are also drawn from 
a uniform distribution U( 0,1). The reference frequency fo 
is taken to be 150 MHz and G = 4 for all simulations. The 
product of the time variation and frequency variation gives 
the complete description of the elements in J(p, k, t , /). 

The sky model consists of point sources, randomly dis¬ 
tributed over a held of view of 7x7 square degrees. Their 
intensities at frequency fo is simulated using a power law 
and their spectral indices are drawn from a uniform distri¬ 
bution U{— 1, 1). The flux of the weakest source calibrated 
is set to be 1 Jy. The number of sources K simulated (and 
calibrated) is varied for different simulations as described 
below. I 11 addition, we also simulate a set of 300 weak back¬ 
ground sources, with peak flux below 0.1 Jy and have a hat 
spectrum. We do not corrupt these sources with errors be¬ 
cause we want to examine the effect of calibration of the 
bright foreground sources on them. 

Finally, we add noise N P9 to the simulated visibilities 
in 0. The elements of N P9 are drawn from a Gaussian dis¬ 
tribution with zero mean and equal variance in real and 
imaginary parts. The noise variance is adjusted such that 
the total noise power is 10 % of total signal power for the 
full observation, which is 6 hours. 

4.1 Simulation I 

We consider calibration along one direction K = 1. For nor¬ 
mal calibration, we use 30 iterations of the RTR algorithm 
(beyond which we do not see any improvement). For dis¬ 
tributed calibration, we use 50 C-ADMM iterations. Each 
C-ADMM iteration performs steps (S1,S2,S3) described in 



Figure 2. Variation of the primal residual with C-ADMM it¬ 
eration number, for two values of smoothing polynomial terms 
F = 2 and F = 5 and three values of regularization factor 
p = 0.5, p = 5 and p = 50. 

section 0 and step (SI) uses 10 iterations of the RTR al¬ 
gorithm. Let the simulated Jones matrices 0 at frequency 
/ be given by J/ and its estimated value be given by J/. 
Then the error between J f and J f (per parameter) is found 
as -A=||Jy — JyU11 where U (£ C 2x2 ) is a unitary matrix de¬ 
noting the unitary ambiguity between the true parameters 
and estimated para meters . It is found by solving a matrix 
Procrustes problem dYatawattall2012ah . We average the er¬ 
ror calculated this way over the P frequencies and all time 
samples to get the final error. 

Moreover, we use two measures of error to study the 
convergence of distributed calibration. We define the ’pri¬ 
mal’ residual as -^=||(J/ 4 ) n — B(Z)^11, averaged over all 
fi. The ’dual’ residual is defined as ||(Z) n+1 — (Z)™||, 

where the superscripts n + 1 and n denote the C-ADMM 
iteration number. The primal residual depicts the error be¬ 
tween the local solution and the predicted consensus value. 
On the other hand, the dual residual depicts the convergence 
of the global variable Z. 

In Fig. [2] we have shown the variation of the primal 
residual and in Fig. [3] the variation of the dual residual, 
both with the C-ADMM iteration number. The regulariza¬ 
tion parameter is set at p = 0.5, p = 5 and p = 50. The 
number of terms in the smoothing polynomial 01 is set at 
F = 2 and F — 5, with F = 2 underestimating the sim¬ 
ulated polynomial order while F = 5 overestimating it. It 
is clear that as the value of p increases, the primal residual 
converges faster, and to a lower value. Also, the dual resid¬ 
ual is lower for a low order polynomial, or a lower value of 
F. 

In Fig. U we show the average error per parameter for 
the chosen values of F and p, after 50 C-ADMM iterations. 
In all cases, distributed calibration gives a lower error than 
normal calibration. Even though the true parameters are 
simulated using a polynomial with G = 4, we get the lowest 
error for both F = 2 and F = 5, at p = 50. The lower bound 
of this error is determined by the noise and the weak sources 
not included in calibration. For this example, this bound is 
too high to see a difference in performance between F = 2 
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Figure 3. Variation of the dual residual with C-ADMM iter¬ 
ation number, for two values of smoothing polynomial terms 
F = 2 and F = 5 and three values of regularization factor 
p = 0.5, p = 5 and p = 50. 



Frequency/MHz 

Figure 4. Variation of average error standard deviation of the 
estimated solutions with frequency after 50 C-ADMM iterations. 
Normal calibration has higher error and distributed calibration 
with p = 50 gives the lowest error, both for F = 2 and F = 5. 
The edge frequencies have higher error for F = 5 due to our 
choice of the interpolating polynomial. 

and F = 5. Moreover, we also have errors due to polynomial 
interpolation, which is clearly seen for F = 5 at the edge 
frequencies. 

4.2 Simulation II 

In this simulation we set K = 25 and we use 20 time sam¬ 
ples in calibration, in other words, calibration is performed 
for every 200 s of data. Therefore, for a 6 hour observation, 
calibration is performed 108 times. We use F = 2 and p = 5 
and each calibration uses 20 C-ADMM iterations. In each C- 
ADMM iteration, there are 3 SAGE iterations. In Fig. 0(a) 
we show the uncalibrated continuum image which is domi¬ 
nated by the errors along strong sources. In Figs. [5] (b) and 
O (c) we show the calibrated image after normal calibration 
and distributed calibration, respectively. The noise (at the 
edge) in Figs. 0(a), (b), and (c) are 3.3 mjy, 0.64 mjy and 


0.49 mjy, respectively. Therefore, there is a clear reduction 
in noise with distributed calibration, although this is not 
visible in Fig. [5] 

In order to clearly show the difference, we have shown 
a small area of the full image in Fig. [6] where we show an 
area surrounding a bright source. The uncalibrated image is 
shown in Fig. |6](a) and images after normal and distributed 
calibration are shown in Fig. [6] (b) and Fig. [6] (c), respec¬ 
tively. It is clear that both normal and distributed calibra¬ 
tion does well in removing the source, and making the weak 
sources clearly visible. However, in Fig. [6] (b), there still is 
an error pattern at the location of the bright source. The 
magnitude of this error pattern is far below the noise floor 
of a single channel. Therefore, it is impossible to eliminate 
this error by normal calibration. However, as seen in Fig. 
[6] (c), distributed calibration does a much better job in re¬ 
moving this error pattern. This also explains the reduction 
of noise in Fig. [S] (c ). We see simila r error patterns in real 
observations (lYatawatta et all 120131 ). and with distributed 
calibration, the quality of images can certainly be improved. 


5 CONCLUSIONS 

We have proposed consensus optimization as a way of per¬ 
forming radio interferometric calibration in a distributed 
way. Distributed calibration enables us to improve the qual¬ 
ity of calibration as well as to distribute the overall com¬ 
putational cost. The aspect we used for consensus is the 
smoothness of calibration parameters over frequency. How¬ 
ever, a similar strategy can also be adopted to exploit spa¬ 
tial and temporal smoothness that can be explored in future 
work. We have given simulation results to confirm the fea¬ 
sibility of distributed calibration and also the expected im¬ 
provement in performance, for instance by avoiding converg¬ 
ing to local minima in the optimization. Future work would 
address better interpolation schemes that enforce consensus 
as well as multiplexing schemes when the number of fre¬ 
quency channels that needs to be calibrated is higher than 
the number of available compute agents. The source code 
for the algorithms described in this paper is available at 
http: / / sagecal.sf.net / 
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Figure 5. Continuum images of simulated data with K = 25 point sources with errors across a field of view of about 7x7 square 
degrees, (a) Raw image before calibration (b) Residual image after normal calibration (c) Residual image after distributed calibration. 
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Figure 6. A small area of Fig.[5]showing a bright source, (a) Raw image before calibration (b) Residual image after normal calibration 
(c) Residual image after distributed calibration. In both (b) and (c) the errors due to the bright source have mostly disappeared, but 
(b) has a spike like error pattern that is also removed in (c). 
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APPENDIX A: GRADIENT AND HESSIAN 
CALCULATION 

This section describes the derivation of the gradient and 
Hessian operators for (I5|i and mi s o tha t the Riemannian 
trust-region algorithm (lAbsil et alj l2007l l can be applied. 
Without loss of generality, we drop the superscript fi in this 
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section. We consider J to be on a matrix manifold denoted 
by M. Let the function to be minimized be g(J). We define 
the inner product for two elements in the tangent space TM 


of this manifold as 

HZ, V) = trac e(£ H ri + ri H £), £, iq € TM. (Al) 
With this definition, the gradient is calculated to satisfy 

HZ: grad(p(J))) = £>s(J)K], V? 6 TM (A 2 ) 

where 

£>S(J)K] = Ijm o g(J + T ^ ) ~ g(J) . (A3) 

Similarly, the Hessian of g{ J) can be obtained as 
Hess s(J)[rj] = lim - (grad g {J + riq) - grad p(J)). (A4) 

r —^0 T 

Now we can rewrite m as 

s(J) = (A5) 


y trace ( (\/ pq - A p JC pq (A 9 J) W ) 

P,<? ' 

x (v p9 -A p JC p9 (A 9 J) H )) 

and we can rewrite as 

H{ J,Z,Y) (A6) 

= g( J) + i trace (y h (J - BZ) + (J - BZ)"y) 

Retrace ((J - BZ) H (J - BZ)) . 

Finally, applying m and (El to El and El gives us 
m and m■ Moreover, by taking gradient with respect to 
Z, we also get (El. Note also that since we use the RTR 
algorithm in Euclidean space, the projection is n(J) = J and 
the retraction is R( J, rj) = J + rj. 


APPENDIX B: CONVERGENCE 

First, we consider the convergence of the RTR algorithm in 
minimizing a funct ion such as gr(J) in j 8 j with respect to J. 
Using (lAbsil et al.ll2008l . 7.4.6), we only need to show that 
the manifold on which J lies (say M) is compact (smooth¬ 
ness of g(J) is obvious). Given that the sky model has finite 
and non-zero flux and the data has finite power, we see that 
||J|| is finite and hence M is bounded. 

Each pair of p, q in (JHJ gives us a set of constraints on the 
values of J that can be expressed as a set of nonlinear func¬ 
tions g Pq ,ij (■, ,....) = 0 for different values of p, q, i,j, which 
are actually mappings from R 8JV to R. Since 0 is a closed set, 
elements of J are in the inverse image of g Pq ,ij(-, ,-•■■) = 0 
which is also a closed set. Note that in order to have expres¬ 
sions such as g Pq ,ij{-, ,....) = 0 that are unique, we need to 
have at least few of them with nonzero values for C pq and 
unique uv coordinates. In other words, we need to have a 
sky model with non zero total flux. For all possible values of 
p,q,i,j, with unique g Pq ,ij[-, ,....) = 0 expressions, we get 
an intersection of closed sets within which the elements in J 
must lie. Since the intersection of closed sets gives a closed 
set, we see that M on which J lies is also closed. Therefore 
M is bot h b ound ed and closed and by Heine-Borel theorem 
(lAbsil et al.ll2008h . it is compact. 


For the convergence of the C-ADMM algorithm, we 
need to have smooth, convex functions for <;(J). This is 
not always guaranteed but with same assumptions as above, 
most of th e time i t can be safely assumed to be convex (see 
lYatawattal d2012blf for a detailed investigation). 


© 2015 RAS, MNRAS 000. |T][9] 











